Which Risks Are Needed?

Introduction

Our study wants to ask participants in how they rate certain risks in their perceived riskiness. The “Risk Group” has decided to use a more systematic approach in filtering out risks which are worth investigating. They used several sources, such as referring to the Basel Risk Norms and 2 separate scoping reviews (one from our Risk Polarization group, the other from Amanda), and have listed around 100 risks, and labeled them according to domains (such as health, finances, political, crime and nature). Though to be as efficient as possible, we have to choose risks which are worth to be asked in the first place, as some risks are more similar than others.

One way to do it is to ask humans to rate/ sort them into clusters/ domains themselves. But as this also takes time and money, this project tries to leverage embeddings to do the clustering and mapping. Huge shout out to the rpackage embedR and its author Dirk Wulff. Working with embeddings in R is made very easy with this package, in addition to the generous pipeline(s) provided in his github webpage.

Method & Results

Here is a list of risks we are working with:

Embeddings

The data has two columns, the name of the domains ("Domain/Label"), and the risks (Risk/Items), which are the targets of the embedding analysis. Using the er_embed() function, we can embed the items. We will be using the default all-mpnet-base-v2 model from hugging face. This model is a good lightweight model for embedding analyses.

We can now see the embeddings for each risk

## 
## embedR object
## 
## Embedding: 99 objects and 768 dimensions.
## 
## 
## Embedding
## 
##                          [,1]       [,2]         [,3]        [,4]        [,5]
## health            0.008797954 0.09673023  0.010030607 -0.01816359  0.03574587
## wildfires        -0.025626084 0.06679159  0.017190674 -0.03250840 -0.01195548
## trust government -0.023342272 0.14670171 -0.006135981  0.03572665  0.00765732
## terrorism        -0.014037776 0.02034166  0.004858475 -0.03388666 -0.06702958
## unemployment     -0.017772257 0.07592832  0.007619646 -0.03449840 -0.01949257

Reducing the dimensions to two helps us in visualizing. The following plot is colored by “Team Risks” domain labels. Looking at the plot, most of the items are well withing their respective labels, but there are still outliers.

instead of comparing each risk to others visually, we can calculate the cosine similarity so we also have a numerical representation of how similar they are to each other (this would also help us in choosing which risks to take, as we dont want too similar ones, as they would become redundant).

How many Domains?

One could pose whether these 5 domains are accurately describing our risks. Luckily, thanks to the embeddings, we have a similarity matrix now. With this, kmeans clustering becomes available. The following plots are generated with differing cluster amounts. As a rule of thumb, the higher the F-statistic (ratio of within labels sum of squares and between labels sum of squares), the better.

WIP: more clusters?
F-values are very similar for all the clusters…
Overlap due to the 768 dimensions? As this plot only uses a reduced one of 2, where information gets lost…?

Discussion

Limitations

Conclusion

Credits

Acknowledgements

References

R Packages Used